Searching the Referentially-compressed Genomes by Incomplete Patterns
نویسندگان
چکیده
Genome banks contain precious biological information that is mostly not discovered yet. Biologists in turn are keen to precisely explore these banks in order to discover effective patterns (such as motifs and retro-transposons) that have a real impact on the function and evolution of living creatures. Because the modern genome sequencing technologies produce genomes in high throughputs, many techniques have emerged to store genomes in the lowest possible space. Reference-based Compression algorithms (RbCs) efficiently compress the sequenced genomes by mainly storing their differences with respect to a reference genome. Therefore, RbCs give very high compression ratios compared to the traditional compression algorithms. However, in order to search a compressed genome for specific patterns, it has to be totally decompressed, wasting both time and storage. This paper introduces searching for either exact or incomplete patterns inside the referentially compressed genomes without their complete decompression. The introduced search methodolgy is based on instantly searching subsequent sequences that are partially decompressed from the compressed genome. Moreover, the same search process is allowed to simultaneously search for multiple patterns, thus saving more resources. The experimental results showed noticeable performance gains compared to traditionally searching the same compressed genomes after their complete referential decompression.
منابع مشابه
String Searching in Referentially Compressed Genomes
Background:Improved sequencing techniques have led to large amounts of biological sequence data. One of the challenges in managing sequence data is efficient storage. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. However, so far sequences always have to be decompres...
متن کاملQGramProjector: Q-Gram Projection for Indexing Highly-Similar Strings
Q-gram (or n-gram, k-mer) models are used in many research areas, e.g. in computational linguistics for statistical natural language processing, in computer science for approximate string searching, and in computational biology for sequence analysis and data compression. For a collection of N strings, one usually creates a separate positional q-gram index structure for each string, or at least ...
متن کاملAn Algorithm for Browsing the Referentially-compressed Genomes
Genome resequencing produces enormous amount of data daily. Biologists need to frequently mine this data with the provided processing and storage resources. Therefore, it becomes very critical to professionally store this data in order to efficiently browse it in a frequent manner. Reference-based Compression algorithms (RbCs) showed significant genome compression results compared to the tradit...
متن کاملRCSI: Scalable similarity search in thousand(s) of genomes
Until recently, genomics has concentrated on comparing sequences between species. However, due to the sharply falling cost of sequencing technology, studies of populations of individuals of the same species are now feasible and promise advances in areas such as personalized medicine and treatment of genetic diseases. A core operation in such studies is read mapping, i.e., finding all parts of a...
متن کاملImproving Exact Search of Multiple Patterns From a Compressed Suffix Array
Self-indexes are largely studied and widely applied structures in string matching. However, the exact matching of multiple patterns using self-indexes is a topic that has not been the subject of concentrated study although it is an area that may have direct and indirect applications and uses in fields such as bioinformatics. This paper presents a method of improving the exact search of multiple...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014